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SUMMARY 


The  application  of  decision  theory  often  involves  assessing  sub- 
jective probabilities,  and  procedures  for  assessing  them  are  quite  well 
developed.  But  such  procedures  are  based  on  assessments  by  a single 
person.  Often  multiple  individuals  are  called  on  to  provide  the  prob- 
abilistic judgments.  Unanimity  in  judgments  among  the  multiple  individ- 
uals cannot  be  expected,  thereby  creating  the  problem  of  how  to  arrive  at 
a single  probability  distribution  that  can  be  used  in  applying  decision 
theory . 

Two  general  approaches  to  this  problem  exist.  The  individuals  can 
interact  as  a group  to  reach  a consensus,  or  the  individual  judgments 
can  be  mathematically  aggregated  to  produce  a single  probability  distrib- 
ution. Each  of  these  approaches  has  advantages  and  disadvantages.  Group 
interaction  allows  the  exchange  of  information,  but  may  be  susceptible  to 
dominance  by  certain  individuals  or  pressure  for  conformity.  Mathematical 

aggregation  is  simple  to  us*  and  ensures  that  a single  distribution  will 
result,  but  theoretical  difficulties  are  encountered  in  specifying  an 
appropriate  aggregation  model. 

Using  several  forms  of  group  interaction  and  mathematical  aggrega- 
tion models,  this  research  investigated  the  quality  of  probabilities  pro- 
duced by  interaction  versus  mathematical  models,  and  by  the  various  forms 
of  interaction  and  various  mathematical  models.  "Quality"  was  measured 
by  proper  scoring  rules,  calibration,  and  extremeness  on  two  types  of 
probability  assessments:  discrete  assessments  for  two-alternative  ques- 
tions and  beta  probability  density  functions  for  questions  about  percen- 
tages. Ten  four-person  groups  comprised  primarily  of  graduate  students 


assessed  probabilities  for  twenty  questions  of  each  type  in  each  of  five 
types  of  group  interaction:  no  interaction,  Delphi,  Nominal  Group  Tech- 
nique (NGT) , a mix  of  Delphi  and  NGT,  and  discussion  to  consensus.  The 
mathematical  models  used  to  aggregate  the  individual  assessments  included 
the  linear  model,  the  weighted  geometric  mean,  and  the  pari-mutuel  model 
for  discrete  assessments;  and  the  linear  model  and  conjugate  model  for 
densities;  each  with  various  weighting  procedures. 

Applying  proper  scoring  rules  to  the  group  probabilities  indicated 
that  simple  mathematical  aggregation  without  any  interaction,  e.g.  linear 
aggregation  with  equal  weights,  generally  produced  group  probabilities 
as  good  as  those  assessed  after  interaction.  Interaction  did  produce  more 
extreme  but  less  well  calibrated  assessments,  with  the  type  of  interaction 
having  little  effect.  Generally,  the  calibration  of  mathematically  ag- 
gregated group  probabilities  prior  to  any  interaction  was  quite  good, 
clearly  better  them  the  calibration  of  individual  assessments. 


These  results  may  appear  relatively  uninteresting  from  a psycho- 
logical perspective  because  of  the  lack  of  differences  in  assessments 
after  different  types  of  interaction.  But  the  implications  for  appli- 
cations of  decision  theory  are  important.  In  many  instances,  simple, 
mathematical  aggregation  of  individual  probability  assessments  may  be 
adequate  without  resorting  to  more  elaborate,  practically  difficult. 


and  time  consuming  interactive  processes  or  modeling  efforts. 
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One  of  the  cornerstones  o£  decision  theory  is  the  concept  of 
subjective  probability.  The  theory  of  subjective  probability  (e.g.. 
Savage,  1954)  provides  a basis  for  quantifying  the  subjective  opinions 
of  a decision  maker  or  experts  whose  opinions  are  used  by  a decision 
maker  in  the  probabilistic  terms  which  can  then  be  used  explicitly  in 
the  decision  making  process.  In  order  to  use  subjective  probabilities, 
techniques  have  been  developed  for  assessing  subjective  probabilities 
(Spetzler  and  Stael  von  Holstein,  1975) . The  development  of  the  theory 
and  the  assessment  techniques  has  led  to  the  use  of  subjective  probability 
in  a wide  variety  of  real  decision  contexts  (Beach,  1976) . 

But  the  applications  of  subjective  probabilities  in  real-world 
contexts  have  also  illuminated  a gap  between  the  theory  and  assessment 
techniques,  and  the  technology  needed  in  certain  decision  situations. 

Often  groups  rather  than  individuals  are  the  decision  makers  or  the 
experts  providing  input  to  the  decision  makers.  And  research  has  shown 
that  the  type  of  judgments  required  are  generally  more  valid  when  made 
by  groups  rather  than  individuals  (Seaver,  1976) . Yet  both  the  theory 


and  the  assessment  techniques  of  subjective  probability  have  been 


primarily  oriented  toward  quantifying  the  uncertainty  of  a single  indi- 
vidual. Although  as  Savage  (1?54,  p.  8)  points  out,  the  theory  is  not 
limited  to  the  single  person  case,  extensions  to  the  multi-person  case 
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depend  on  some  sort  of  unanimity  of  action  among  the  groi$>  members. 
Such  unanimity  rarely  exists  in  decision  making  groins  until  some 
process  specifically  aimed  at  achieving  it  is  undertaken. 

One  possible  way  in  which  to  create  a form  of  unanimity  iB  for 
the  group  to  interact  to  reach  a consensus.  But  social  pychological 
research  suggests  that  several  aspects  of  the  interaction  process  may 
reduce  the  quality  of  the  resulting  consensus  (Collins  and  Guetzkow, 
1964;  Davis,  1969;  Van  de  Ven  and  Delbecq,  19*71)  . For  example,  inter- 
acting groups  will  often  expend  considerable  time  and  effort  simply 
structuring  the  group  and  the  interacting  process,  both  explicitly  and 
unknowingly.  Additionally,  dominance  by  individuals  because  of  status 
or  personality  may  decrease  the  effectiveness  of  the  group.  Or, 
pressure  for  conformity  may  cause  the  group  to,  in  effect,  make  simply 
reaching  an  agreement  more  important  than  the  substantive  value  of  the 
consensus . 

Elaborate  interactive  processes  that  attempt  to  circumvent 
these  factors  have  been  the  subject  of  extensive  research.  Typically, 
such  processes  rely  on  strictly  controlled  interaction  and  do  not 
actually  produce  a consensus,  but  rather  necessitate  some  type  of 
aggregation  of  individual  judgments  to  produce  the  group  judgment. 
Since  these  processes  are  often  quite  time-consuming  and  their  effec- 
tiveness is  questionable,  simpler  approaches  to  the  problem  of  deter- 
mining group  probabilities  should  be  considered. 

One  obvious  siuple  approach  is  to  average  the  individual 
probabilities;  or  use  some  other  mathematical  aggregation  rule.  Theo- 
retical difficulties  with  mathematical  aggregation  do  exist,  however, 
as  shown  by  Dalkey  (1972).  He  proved,  in  the  spirit  of  Arrow’s  (1951) 


~-n»T  1 ■* 


3 


Impossibility  Theorem,  that  there  is  no  rule  for  aggregating  individual 
probabilities  into  a group  probability  distribution  that  satisfies  a 
set  of  seemingly  reasonable  conditions.  Additionally,  the  more 
rigorous  and  theoretically  appealing  mathematical  aggregation  models 
are  difficult  to  apply  in  practice  because  an  inordinate  amount  of 
data  or  extremely  complex  judgments  are  required  as  inputs  to  the  models. 
Simplifying,  although  unrealistic,  assumptions  can  be  made  that  allow 
use  of  these  models. 

Although  it  has  some  problems,  mathematical  aggregation  of 
individual  probabilities  does  have  two  advantages  over  interaction: 
the  group  probability  will  always  be  produced,  and  it  will  be  obtained 
using  less  of  the  decision  makers'  or  experts'  time.  Whether  or  not 
mathematical  aggregation  should  generally  be  advocated  for  obtaining 
group  probabilities  should  and/or  would  depend  on  two,  probably  related, 
factors:  the  quality  of  the  resulting  probabilities,  and  the  accept- 
ability of  the  procedure  to  the  group.  In  fact,  should  the  group  agree 
to  use  some  mathematical  aggregation  rule  to  determine  the  group  prob- 
abilities, it  is  in  effect  producing  the  unanimity  necessary  for  the 
theory  of  subjective  probability. 

Thus,  the  question  of  what  is  the  best  way  to  reach  unanimity 
is  an  empirical  question.  Will  the  quality  of  group  probabilities 
produced  by  mathematical  aggregation  of  individual  probabilities  be 
good  enough  so  that  such  a procedure  can  be  advocated  rather  than  the 
much  more  cumbersome  interaction  processes?  If  so,  what  mathematical 
model  should  be  used  for  aggregation?  If  not,  is  there  a specific 


interactive  process  that  works  best?  The  experimental  research  reported 


here  explores  the  answers  to  these  questions. 
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However,  before  describing  the  experiment  and  presenting  the 
results,  some  additional  information  is  presented.  First,  several 
concepts  concerning  probability  and  its  use  in  this  research  are  de- 
fined and  explained.  Then  the  specific  nature  of  the  different  types 
of  both  interaction  processes  and  mathematical  aggregation  models  are 
described,  along  with  the  scant  empirical  research  on  the  relative 
merits  of  the  various  means  of  determining  group  probabilities.  Sub- 
sequently, the  experiment  and  the  obtained  results  are  presented.  And, 
finally,  the  implications  of  the  research  for  groups  faced  with  the  task 
of  determining  probabilities  are  discussed  with  special  emphasis  on 
applications  in  realistic  situations. 


I 


CONCEPTS  IN  ASSESSING  AND  EVALUATING 
SUBJECTIVE  PROBABILITIES 

Assessing  Subjective  Probabilities 

Procedures  for  both  assessing  and  evaluating  subjective  prob- 
abilities depend  on  the  nature  of  the  propositions  or  events  for  which 
probabilities  are  assessed.  If  the  events  under  consideration  are 
discrete — that  is,  the  space  of  possible  events  is  represented  by  a 
/ finite  number  of  mutually  exclusive  and  exhaustive  events — then 

assessments  can  take  the  form  of  a probability  between  0.0  and  1.0. 

If,  however,  the  events  are  represented  by  a continuum  with  an  infinite 
number  of  possibilities , then  the  assessments  must  be  probability 
density  functions.  Procedures  for  eliciting  probability  density  func- 
tions often  produce  only  approximations  (cf.  Seaver,  von  Winterfeldt, 
and  Edwards,  1978) . Spetzler  and  Steel  von  Holstein  (1975)  discuss  particular 
procedures  for  eliciting  appropriate  judgments  for  both  types  of 
assessments . 

When  complete  probability  density  functions  are  needed,  often 
a particular  family  of  distributions  (e.g.,  normal  or  beta  distributions) 
can  provide  enough  flexibility  by  varying  parameters  to  represent  sub- 
jective opinion.  This  is  especially  useful  in  certain  instances  when 
information  from  a variety  of  sources  is  to  be  combined;  e.g.,  subjective 
prior  probability  with  objective  data,  or,  in  some  instances,  multiple 
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subjective  prior  probabilities.  Bayes'  Theorem  provides  the  appropriate 
mechanism  for  combining  information.  If  the  information  being  combined 
is  represented  by  distributions  that  are  members  of  a conjugate  family 
of  distributions , the  distribution  resulting  from  the  application  of 
Bayes'  Theorem  will  also  be  a member  of  the  same  family  of  distributions 
(DeGroot,  1970).  For  example,  beta  distributions  can  be  combined  to 
produce  another  beta  distribution,  or  combining  normal  distributions 
produces  a normal  distribution.  And  use  of  conjugate  distributions 
greatly  simplifies  the  computation  necessary  in  applying  Bayes'  Theorem. 
Evaluating  Subjective  Probabilities 

In  a philosophical  sense,  subjective  probabilities  by  their  very 
nature  cannot  be  externally  evaluated.  They  are  judgments  or  opinions, 
and  as  such  can  only  be  evaluated  in  terms  of  how  well  the  elicited 
judgment  represents  the  internal  opinion.  But  in  a practical  sense, 
certain  criteria  characterize  properties  subjective  probabilities  should 
have.  Seaver,  von  Winterfeldt,  and  Edwards  (1978)  have  identified  five 
such  desiderata: 

1,  Assessments  should  be  consistent  with  the  laws  of  probabil- 
ity theory. 

2,  Assessments  should  be  extreme.  For  discrete  assessments, 
this  implies  that  probabilities  assigned  to  events  that  occur  should  be 
near  1,0,  while  those  assigned  to  non-occurring  events  should  be  near 
0,0.  Continuous  assessments  should  have  a high  density  at  the  true 
value  and  a density  near  0,0  elsewhere, 

3,  Assessments  should  be  well-calibrated,  This  means  that 
multiple  assessments  should  have  the  property  that  the  events  for  which 
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the  probabilities  are  assessed  occur  with  a relative  frequency  equal  to 
the  assessed  probability.  For  example,  discrete  events  for  which  the 
assessed  probability  is  .75  should  occur  about  75  percent  of  the  time. 
And  about  50  percent  of  the  true  values  should  fall  below  the  medians 
of  assessed  probability  densities,  or  within  the  interquartile  ranges. 

4.  Assessments  should  produce  high  scores  when  evaluated  with 
proper  scoring  rules  (see  Murphy  and  Winkler,  1970;  Stael  von  Holstein, 
1971).  These  scores  measure  a combination  of  criteria  2 and  3,  which 
typically  will  conflict.  The  defining  property  of  proper  scoring  rules 
is  that  the  expected  value  of  the  score  is  maximized  if  and  only  if  the 
assessor  reports  his  or  her  true  opinion.  An  often  used  proper  scoring 
rule  for  discrete  assessments  is  the  quadratic  scoring  rule: 

n 2 

Sk  - 2p(0  ) - I p(0 .)  (1) 

K j=l  3 

where  is  the  score  if  0^  occurs.  The  continuous  form  of  the  ranked 
probability  score  is  an  example  of  a proper  scoring  rule  for  continuous 
assessments  (Matheson  and  Winkler,  1976) i 

t . 

S * - / P(0)de-  / (1  - P (0) ) d 0 (2) 

—00  t 

where  P(0)  is  the  cumulative  assessed  distribution  and  t is  the  true 
value  of  0. 

5.  Assessments  should  be  responsive  to  evidence,  Seaver 

et  al.  suggest  this  means  probabilities  should  be  revised  as  evidence 
accumulates  as  specified  by  Bayes1  Theorem,  In  a formal  sense,  this 
follows  from  the  laws  of  probability  (criterion  1.), 
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In  any  given  situation,  probability  assessments  are  usually 
not  evaluated  using  all  five  desiderata.  Elicitation  procedures  often 
do  not  allow  properties  1 or  5 to  be  violated.  Most  investigations  of 
procedures  for  assessing  subjective  probabilities  have  focused  on 
properties  3 and  4.  Lichtenstein,  Fischhoff,  and  Phillips  (1977)  have 
reviewed  the  research  on  the  calibration  of  (individual)  assessments, 
most  of  which  indicated  assessments  are  usually  not  well-calibrated. 
Scores  tend  to  vary  depending  on  the  assessor's  expertise  and  training 
(cf.  Stael  von  Holstein,  1971,  1972;  Kinkier,  1971),  but  often  scores  are 
only  slightly  better  them  would  be  achieved  with  uniform  probabilities. 
Thus,  clearly,  assessments  can  be  improved,  and  using  multiple  persons 
is  a possible  means  of  improvement. 


ASSESSMENT  APPROACHES 


Mathematical  Aggregation  Procedures 

A variety  of  mathematical  models  for  combining  individual 
probabilities  into  a composite  or  group  probability  have  been  suggested. 
Depending  upon  the  particular  model,  these  models  may  be  applicable  for 
aggregating  either  discrete  probabilities  or  density  functions,  or 
both.  Some  are  quite  simple  mathematically,  although  the  underlying 
theoretical  justification  may  be  quite  complex;  while  others  are  quite 
complicated  and  often  unusable  in  realistic  situations.  Although 
unusable  in  their  general  form,  still  these  more  complex  models  are 
practically  beneficial  because  they  can  be  simplified  with  certain 
assumptions . 

Weighted  linear  combination.  This  procedure,  sometimes  called 
the  "opinion  pool,"  can  be  used  with  both  discrete  probabilities  and 
density  functions.  2t  takes  the  form 

m 

. P_ (0)  - I (©)  (3) 

G i-1  1 1 

where  p_  is  the  group  probability  (density  function)  and  w.  and  p.  are  the 

U 11 

weight  and  the  probability  (density)  respectively  of  individuals  i«l, 

. . . , m.  Stone  (1961)  was  the  first  to  present  a formal  justification 
for  this  model  when,  assuming  a convex  utility  function  common  to  all 
individuals,  he  proved  the  rather  weak  result  that  the  utility  of  the 
decision  made  on  the  basis  of  an  opinion  pool  was  greater  than  or  equal 
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to  the  minimum  utility  of  a decision  based  on  the  probability  distri- 
bution of  any  individual.  Stronger  results  were  obtained  by  Bacharach 
(1975)  using  stronger  assumptions.  Again,  given  a common  utility 
function,  and  a group  preference  ordering  satisfying  forms  of  indepen- 
dence of  irrelevant  alternatives  and  Pareto  Optimality,  along  with  a 
couple  of  technical  assumptions;  then  the  group  maximizes  expected 
utility  given  a probability  distribution  in  the  form  of  linear  combina- 
tion of  the  individual  probability  distributions. 

DeGroot  (1974)  has  taken  a different  approach  to  formalizing 
the  justification  for  weighted  linear  combination  of  probabilities. 
Individuals  are  assumed  to  revise  their  own  probabilities  as  weighted 
linear  combinations  of  the  revealed  probabilities  of  other  group  members. 

In  a group  with  m individuals,  each  individual  i assigns  weight  w. . to 

m 13 

individual  j,  with  all  w >0  and  £w  . „ 1 for  au  This  revision  process 

i-1  J 

is  assumed  to  be  iterative  with  a constant  matrix  of  weights  W and  a 

vector  of  initial  individual  probability  distributions,  P,  with  elements 

p. , . . . ,p  . Then,  after  n iterations  the  vector  of  probabilities  is 
1 m 


_(n)  (n-1)  n 

P «WP  «*W  P. 


The  elements  of  P^  will  converge  to  the  same  limit;  i.e.,  a consensus 

* * 

is  reached,  if  and  only  if  there  is  a vector  W*«(w, ,...,w  ) such  that 

X 01 


, . It 

lim  w -w 
n • 3 


j 


for  all  i and  j,  where  w"^  is  an  element  of  Wn.  DeGroot  proved  that 
W*  exists  if  there  is  at  least  one  person  in  the  group  who  receives  non- 
zero weights  from  all  group  members.  The  elements  of  W*  can  be  found 
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by  solving  the  set  of  linear  equations  v*W-w*  subject  to  the  constraint 

si 

I v*-l. 
j-1  3 

The  group  probability  distribution  is  then  the  linear  combination  of  the 
initial  individual  assessments  weighted  by  the  w*'s. 

One  specific  advantage  of  the  DeGroot  formulation  is  that  it 
explicitly  reveals  how  weights  are  to  be  determined.  Other  justifica- 
tions leave  this  question  completely  open.  However,  several  procedures 
for  assigning  weights  have  been  suggested  and  empirically  tested,  but 
will  be  discussed  later  since  they  pertain  to  other  aggregation  methods 
as  well  as  the  linear  combination. 

The  linear  combination  is  the  only  mathematical  aggregation  rule 
that  has  received  much  empirical  attention  as  a means  of  generating 
composite  probabilities.  Several  studies  have  shown  that  weighted  linear 
combinations  of  individual  probabilities  are  generally  superior  to  in- 
dividual assessments  as  evaluated  by  proper  scoring  rules  (Brown,  1973; 
Stael  von  Holstein,  1971,  1972;  Winkler,  1971).  However,  since  proper 
scoring  rules  are  concave  functions  on  the  probability  simplex,  the 
score  of  the  average  of  individual  probabilities  will  necessar  'j  be 
better  than  the  average  of  the  individuals'  scores.  Nevertheless,  the 
evidence  is  quite  striking  since  usually  only  10  percent  or  fewer  of  the 
individual  subjects  out-perform  the  group  assessments. 

Other  evaluations  also  argue  for  the  superiority  of  weighted 
linear  combinations.  Winkler  (1971)  made  hypothetical  bets  based  on 
both  individual  and  weighted  linear  combinations  of  individual  probability 
assessments  for  football  game  winners.  For  various  betting  schemes,  bets 
based  on  the  weighted  linear  combinations  won  from  2$  to  47t  more  per 
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dollar  bet  than  did  bets  based  on  individual  assessments.  This  economic 
evaluation  is  rather  impressive  support  for  weighted  linear  combina- 
tions of  individual  probability  assessments. 

Bayesian  models  and  approximations.  Since  probability  assess- 
ments may  be  considered  information  pertaining  to  a set  of  hypotheses, 
a natural  procedure  for  combining  such  assessments  would  be  to  use 
Bayes'  Theorem,  the  formally  correct  procedure  for  combining  prob- 
alistic  information.  Somewhat  similar  treatments  of  this  problem  have 
been  suggested  by  Dalkey  (1975)  and  Morris  (1974,  1977). 

Morris  derived  results  applicable  from  the  point  of  view  of  a 
decision  maker  faced  with  the  task  of  combining  the  probabilistic  judg- 
ments of  multiple  experts  with  his  or  her  own  judgment.  However,  with 
some  very  minor  adjustments,  his  model  is  applicable  to  the  general 
problem  of  combining  probabilistic  judgments.  Drawing  on  Bayes'  theorem 
the  most  general  form  of  the  model  is 

P (0)  = k*C  (G)  *p  (0) p (0)»pft(0) 

G 1 in  u 

where  k is  a normalization  constant  and  pQ  is  the  prior  probability,  in 
most  cases  probably  assumed  to  be  uniform,  but  possibly  derived  from 
other  sources;  e.g. , historical  data  and  p^  (0)  is  the  distribution 
assessed  by  expert  i.  C(0),  the  "Joint  Calibration  Function"  (Morris, 
1977) , reflects  both  the  lack  of  independence  among  the  individual 
judgments  in  the  sense  that  knowing  the  judgment  of  one  individual 
provides  information  about  the  probable  judgment  of  another  individ- 
ual; and  the  lack  of  calibration  of  the  individual  judgments.  This 
function  is  generally  impractical  to  derive  because  of  the  necessity 
for  inordinate  amounts  of  data  or  very  complex  judgments.  Therefore, 
simplifying  assumptions  must  be  made  to  utilize  this  model. 
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The  same  problem  occurs  with  Dalkey's  (1975)  development  of  the 
"probabilistic  approach,"  which  deals  only  with  discrete  probability 
assessments.  In  this  model,  the  group  probability  of  event  0^  *s 
derived  as 


W 


m 

.» ri(eklpi,ek” 

i=l . 

.£,V\ 

j=l  i=l  J 

Rather  than  aggregating  p^(©),  the  assessed  probabilities,  this  formu- 
lation aggregates  (©^ |p^ (0^ ) ) , the  value  of  individual  i's  calibration 
function  at  p^(0).  For  example,  if  for  some  assessor  only  80  percent  of 
the  propositions  assigned  a probability  of  .9  occur,  then  r would  be  .8 
when  p is  .9.  The  terms  reflect  the  lack  of  independence  of  the  in- 
dividual judgments  and  the  prior  probabilities.  These  terms  would  often 
be  very  difficult  to  determine,  and  in  many  instances,  the  r^'s  might 
also  not  be  readily  available. 

Two  major  assumptions  are  necessary  to  make  either  of  these  models 
easily  usable:  independence  among  assessors  and  perfect  calibration,  i.e., 
r^p^.  Then,  in  the  discrete  case  with  uniform  prior  probabilities  either 
model  reduces  to 


PG(ek) 


xn 

Pi(®k> 

1=1 


n m 

E HP, (0.) 


(4) 


j=l  i=l 


i D 


If  n,  the  number  of  hypotheses,  is  two,  this  model  is  equivalent  to  the 
likelihood  ratio  form  of  Bayes’  Theorem,  with  each  individual's  odds  as 
the  likelihood  ratio  inputs.  If  the  prior  probabilities  are  not  uniform, 
in  Morris's  model,  the  prior  probability  would  simply  be  treated  equivalently 
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to  the  assessment  of  another  individual,  but  in  the  Dalkey  model,  the 
prior  distribution  enters  into  the  calculations  in  a much  more  complex 
manner  (see  Dalkey,  1975,  pp.  252-255). 

With  assessments  of  density  functions,  assumptions  of  independence 
and  perfect  calibration,  and  the  additional  requirement  that  all  indi- 
vidually assessed  densities  be  members  of  the  same  family  of  conjugate 
distributions;  Morris'  model  becomes  the  natural-conjugate  model  sug- 
gested by  Winkler  (1968) . Using  conjugate  distributions  is  not  necessary 
for  the  Morris  model,  but  does  greatly  simplify  the  mathematics. 

Winkler  generalized  the  conjugate  model  somewhat  by  allowing 
each  individual's  distribution  to  be  weighted.  Differences  in  weights 
should  represent  differences  in  the  validity  of  the  assessed  distribu- 
tions, while  the  sum  of  the  weights  (in  this  case  not  required  to  be 
one)  should  represent  in  some  sense  the  independence  of  the  assessments. 
Thus,  Winkler  argued  for  the  sum  of  the  weights  to  be  between  one  and  m, 
the  number  of  assessors,  because  a sum  of  one  represents  complete  de- 
pendence, while  a sum  of  m represents  complete  independence.  However, 
a type  of  dependence  in  which  the  entire  set  of  distributions  provides 
more  information  than  do  the  single  distributions  by  themselves  might 
lead  to  sums  greater  than  m,  so  such  a restriction  is  really  not  justified. 

The  idea  of  weighting  the  individual  assessments  in  the  discrete 


case  extends  the  model  (eq.  4)  to: 


where  is  the  weight  assigned  to  individual  i's  assessment.  If  the 
weights  are  required  to  sum  to  one,  the  group  probability  is  then  the 
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normalized  weighted  geometric  mean  of  the  individual  assessments.  This 
model  is  then  the  multiplicative  parallel  to  the  linear  combination 
model  which  is  the  weighted  arithmetic  mean.  Zn  considering  the  weighted 
geometric  and  arithmetic  means,  it  is  useful  to  keep  in  mind  Dalkey's 
(1972)  result  showing  that  aggregation  by  addition  generally  destroys  the 
multiplicative  properties  of  the  probabilities,  whereas  aggregation  by 
multiplication  destroys  additive  properties. 

Pari-mutuel  model.  An  ingenious  and  appealing  aggregation  model 
has  been  suggested  by  Eisenberg  and  Gale  (1959)  bi»<-  ’ on  the  pari- 
mutuel betting  system  used  at  race  tracks.  The  pari-mutuel  betting 
system  provides  a natural  set  of  track  or  consensus  odds  (or  equivalently, 
probabilities) . Eisenberg  and  Gale  investigated  the  conditions  under 
which  similar  consensus  probabilities  could  be  explicitly  determined  from 
a set  of  individual  assessments.  They  formulated  the  problem  as  follows. 
Suppose  there  are  m individuals  and  n mutually  exclusive  and  exhaustive 
events,  and  each  individual  i has  amount  b^  to  bet,  with  the  b^'s  nor- 
malized to  sum  to  one.  Each  individual  i bets  on  event  j. 


n 

z e,. 


j-i 


13  *>1 


so  as  to  maximize  his  or  her  subjective  expected  value  and  the  final 
consensus  probabilities  are  proportional  to  the  total  amount  bet  on  each 
event.  That  is 

»g'V  -jj  6i*. 

where  equality  holds  because  of  the  normalization  of  the  km ’s.  Individual 
i will  maximize  expected  value  by  betting  only  on  those  events  for  which 
Pi(0j)/Pc(6j)  i*  maximum. 
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At  this  point  the  reasoning  appears  to  be  circular:  individuals 
cannot  bet  without  knowing  p_,  and  p„  cannot  be  determined  until  the  bets 
are  made.  Eisenberg  and  Gale  do  not  give  a solution  to  this  circularity. 
Rather,,  they  simply  prove  that  a set  of  bets  and  a unique  set  of  con- 
sensus probabilities  exist  that  are  consistent  with  this  model.  The 
consensus  probabilities  are 


w 


max 
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bipi(V 
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The  values  x^  are  the  values  that  maximize  the  function 


m 


Ffx^ , . . . ,xm^)  = Z b^log  Z P;(0_.)x  . 


mn 


i-i 1 Vi*1'  3'  i3 
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with  x. . > 0 and  Z x. . * 1,  for  all  i and  j. 
13  ‘ i-1  13 


Norvig  (1967)  has  proved  the  same  result  with  a more  intuitively 
appealing  mathematical  approach.  He  formulated  the  problem  as  an 
interactive  process  in  which  individuals  place  bets  which  lead  to  con- 
sensus probabilities,  which  then  allow  individuals  to  place  new  bets, 
etc.  The  consensus  probabilities  will  then  converge  on  the  same  prob- 
abilities specified  in  the  Ei&enberg-Gale  model. 

Weighting  procedures.  Most  of  the  mathematical  aggregation 
models  allow  the  individual  assessments  to  be  differentially  weighted. 
Even  the  pari-mutuel  model,  although  not  explicitly  referring  to  weights, 
allows  weighting  via  the  amount  each  individual  can  bet.  Therefore,  the 
specification  of  weights  is  a necessary  part  of  the  use  of  these  models. 
Several  procedures  have  been  suggested  and  empirically  tested  with  linear 
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combination  models,  including  both  theoretically  developed  procedures 
and  strictly  ad  hoc  methods.  Zn  enpirical  tests,  the  theoretical  pro- 
cedurea  have  not  shown  any  superiority  to  ad  hoc  methods  of  assigning 
weights.  An  informal  test  (Hogarth,  1977)  of  weights  derived  using 
the  DeGroot  (1974)  model  showed  it  led  to  predictions  that  were  slightly 
inferior  to  those  of  a simple  average  (equal  weights) . 

Roberts  (1965)  has  suggested  another  weighting  procedure  based 
on  the  predictive  probability  of  previous  assessments.  However,  because 
the  weights  for  most  individuals  will  rapidly  approach  zero,  this  pro- 
cedure has  proved  to  be  impractical  (Winkler,  1971) . 

The  more  ad  hoc  weighting  procedures,  usually  based  on  past 
performance,  self-ratings,  or  ratings  by  others,  have  received  consider- 
able attention.  Stael  von  Holstein  (1972)  compared  several  weighting 
procedures  based  on  prior  performance  and  found  little  or  no  difference 
among  them.  Similar  results  have  been  obtained  with  self-ratings  and 
ratings  by  others  (Gough,  1975;  Rowse,  Gustafson,  and  Ludke,  1974; 

Stael  von  Holstein,  1971;  Winkler,  1971). 

These  results  are  not  surprising  given  the  "flatness"  of  linear 
models  (von  Winterfeldt  and  Edwards,  1973).  This  flatness  ensures  that 
relatively  large  changes  in  weights  will  produce  only  small  changes  in 
the  output  of  the  model.  Since  both  the  aggregation  procedure  (linear 
combination)  and  the  evaluation  procedure  (proper  scoring  rules)  are 
linear  models,  flatness  is  doubly  ensured.  Whether  or  not  this  in- 
sensitivity to  weights  also  holds  for  nonlinear  aggregation  models  and 
other, types  of  evaluations  remains  to  be  investigated. 

Behavioral  Interaction 


An  alternative  to  mathematical  aggregation  of  individual 
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probabilities  is  some  kind  of  behavioral  interaction.  This  can  be  used 
either  in  conjunction  with  mathematical  aggregation  or  simply  by  itself. 
Interaction  here  refers  to  any  form  of  communication  or  transfer  of 
information  and  ideas  among  the  individuals  making  the  assessments,  so, 
therefore,  is  not  limited  to  face-to-face  discussions. 

The  most  obvious  reason  for  allowing  interaction  among  group 
members  is  that  each  may  have  information  that  is  useful  to  the  others 
in  making  their  assessments.  By  sharing  this  information  the  assessment 
of  each  individual,  and,  therefore,  the  group  assessment  may  be  improved. 
This  need  not  necessarily  happen,  however,  because  the  information  may, 
in  fact,  produce  worse  assessments.  However,  if  the  potential  can  be 
exploited,  the  interaction  should  be  beneficial.  In  fact,  consensus 
probabilistic  judgments  determined  through  interaction  have  been  shown 
empirically  to  be  superior  to  individual  judgments  (Goodman,  1972; 

Stael  von  Holstein,  1971) . 

Social  psychological  research  suggests  some  other  reasons  that 
favor  interacting  groups  in  a wide  variety  of  judgmental  tasks.  Inter- 
action is  likely  to  make  group  members  feel  more  responsible  for  the 
group  judgment,  and,  therefore  increase  their  motivation.  This  also 
has  a practical  beneficial  side  effect:  the  grovp  members  are  more 
likely  to  accept  a judgment  arrived  at  in  this  manner  as  the  basis  for 
making  a decision  (Collins  and  Guetzkow,  1964;  Davis,  1969) . 

Given  these  potential  positive  benefits  of  behavioral  interaction, 
considerable  interest  has  developed  in  finding  ways  to  take  advantage 
of  them  without  the  group  being  exposed  to  the  known  negative  aspects 
of  interaction  such  as  dominant  individuals  and  pressure  for  conformity 
that  typically  accompany  uncontrolled  interaction.  In  particular,  two 
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procedures  that  control  Interaction  have  been  developed  and  widely 
utilized  in  a variety  of  contexts:  Delphi,  developed  by  Dalkey  and  Helmer 
at  The  Band  Corporation;  and  the  Nominal-Gro\q>-Technique  (NGT)  developed 
by  Delbecq  and  Van  de  Ven  at  the  University  of  Wisconsin.  Both  proce- 
dures rely  on  controlled  interaction,  and  neither  actually  leads  to  a 
group  consensus;  therefore,  necessitating  the  use  of  some  type  of 
mathematical  aggregation.  The  procedural  details  and  enpirical  support 
for  these  methods  are  reviewed  in  the  following  subsections. 

Delphi.  Delphi  was  first  used  in  1951  to  elicit  expert  judgments 
about  the  number  of  A-bombs  needed  to  reduce  U.S.  munitions  output  to  a 
certain  level  (Dalkey  and  Helmer,  1963) . Since  then  it  has  achieved 
wide-spread  use,  particularly  in  industry  for  predicting  technological 
development  (tinstone  and  Turoff,  1975;  Sackman,  1974).  Many  different 
procedures  have  been  used  under  the  name  "Delphi,"  but  as  originally 
conceived,  Delphi  includes  three  basic  features:  (1)  anonymity  of  group 
members;  (2)  iteration  of  responses  with  controlled  feedback  between 
iterations;  and  (3)  statistical  aggregation  (unspecified  as  to  type)  of 
individual  judgments  to  form  the  group  response  (Dalkey,  1969b) . 

These  characteristics  are  designed  to  reduce  some  of  the  poten- 
tial problems  associated  with  face-to-face  discussion  groups.  The 
anonymity  ensures  that  no  individuals  can  dominate  the  group  because 
of  status.  Iteration  and  controlled  feedback  allow  the  exchange  of 
information  without  the  value  of  the  information  being  affected  by  its 
source.  Finally,  the  statistical  grotp  response  lessens  the  pressure 
for  conformity  and  takes  advantage  of  the  error  variance  reduction  of 
statistical  aggregation. 

ftie  validity  of  responses  obtained  using  Delphi  was  studied  in 
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• series  of  experiments  st  The  Rand  Corporation  (Dalkey,  1969a,  1969b; 
Dalkey,  Brown  and  Cochran,  1970a,  1970b) . Zn  the  only  study  that 
conpared  Delphi  responses  with  the  consensus  of  face-to-face  discussion 
groups  (Dalkey,  1969b),  Delphi  yielded  more  accurate  answers  on  13  of 
20  questions,  marginal  support  at  best  for  Delphi.  Additional  support 
came  from  a second  part  of  the  study  in  which  groups  used  Delphi  between 
rounds  one  and  two  of  responses,  and  face-to-face  discussion  between 
rounds  two  and  three.  There  was  slightly  more  inprovement  between  rounds 
one  and  two,  but  again  this  support  is  quite  weak  given  the  small  dif- 
ference and  the  obvious  design  flaws.  The  Delphi  procedure  does  lead 
to  improved  judgments  with  successive  rounds,  but  the  convergence  of 
judgments  is  much  larger.  In  fact,  generally  the  judgments  converge 
much  more  than  is  justified  by  the  improvement  (Dalkey,  1969a,  1969b). 

The  use  of  Delphi  as  a technique  for  generating  quantitative 
assessments  of  unknown  quantities  from  multiple  experts  seems  to  be  much 
more  extensive  than  can  be  justified  by  the  empirical  research  (Sackman, 
1974).  Several  features  of  Delphi  can  be  questioned:  the  multiple 
iterations  apparently  produce  more  convergence  than  is  justified;  and 
the  anonymity  of  respondents  suppresses  a potentially  important  feature 
of  the  feedback  information;  namely,  its  source. 

Clearly,  the  value  of  Delphi  has  not  been  firmly  established, 
particularly  as  a tool  for  assessing  group  probabilities.  There  have 
been  enough  positive  results,  however,  to  justify  further  investiga- 
tions. A few  studies  have  used  the  Delphi  method  to  assess  group 
probabilities.  They  will  be  discussed  following  the  presentation  of  the 
Nominal-Group-Technique . 


21 


Nominal-Group-Technique . Van  de  Ven  and  Delbecq  (1971)  reviewed 
the  literature  on  the  effectiveness  of  nominal  groi$>s  (groves  with  no 
spontaneous  Interaction)  versus  interacting  groups  on  problem-solving 
and  decision-making  tasks,  and  concluded  that  a process  combining  the 
attributes  of  these  two  processes  should  be  more  effective  than  either 
alone.  On  this  basis,  they  developed  and  tested  the  NGT.  The  specific 
procedure,  described  in  Delbecq,  Van  de  Ven,  and  Gustafson  (1975) , in- 
cludes (1)  silent  judgments  by  individual  group  members  in  the  presence 
of  the  group;  (2)  presentation  to  the  groi%>  without  discussion  of  all 
individual  judgments;  (3)  group  discussion  for  clarification  and  evalua- 
tion controlled  by  a group  leader  to  prevent  dominance  and  to  focus  on 
relevant  issues;  (4)  individual  reconsideration  of  judgments;  and  (5) 
mathematical  aggregation  of  final  individual  judgments. 

Thus,  like  the  Delphi  method,  NGT  may  reduce  pressure  for 
conformity  by  not  forcing  a consensus.  The  controlled  discussion  also 
reduces  the  chance  for  dominance  by  individuals,  although  perhaps  not 
to  the  extent  Delphi's  anonymity  does.  Both  procedures  eliminate  the 
need  for  the  group  to  provide  structure  since  it  is  implicit  in  the 
procedure.  The  primary  differences  in  Delphi  and  NGT  are  that  NGT  re- 
quires that  group  members  actually  be  together  physically  and  allows 
face-to-face  discussion.  NGT  also  provides  knowledge  of  the  source  of 
any  and  all  information.  Additionally,  NGT  requires  an  active  leader. 
Delbecq  et^  al . (1975)  discuss  the  advantages  and  disadvantages  of  this 
type  of  leadership  role. 

Much  of  the  empirical  support  for  the  NGT  comes  from  a problem- 
solving study  with  rather  weak  evaluation  criteria  (Van  de  Ven  and 
Delbecq,  1974).  Grovps  using  Delphi,  NGT,  and  uncontrolled  interaction 


were  compared  on  the  number  of  alternatives  generated  and  the  perceived 
satisfaction  of  group  members.  NGT  clearly  led  to  more  satisfaction, 
while  NGT  and  Delphi  groups  both  produced  more  alternatives  than  the 
interacting  groups.  Neither  of  these  measures  has  much  relevance  to 
the  quality  of  the  group  judgments,  but  the  satisfaction  may  be 
practically  iirportant. 

Experimental  comparisons  with  probabilistic  judgments.  Although 
neither  Delphi  nor  NGT  were  developed  for  assessing  probabilities,  both 
obviously  could  be  applied  in  this  capacity.  In  fact,  three  studies  have 
specifically  compared  these  procedures  with  interacting  groups  and 
mathematical  aggregation  without  any  interaction.  Gustafson,  Shukla, 
Delbecq,  and  Walster  (1973)  compared  groups  making  judgments  about  the 
likelihood  ratios  of  male  versus  female  given  certain  heights.  Four 
types  of  groups  were  used:  mathematical  aggregation  without  inter- 
action; NGT;  Delphi;  and  modified  interacting  groups.  The  modification 
to  the  interacting  groups  was  that  no  actual  consensus  was  required 
prior  to  individual  judgments  after  the  interaction.  Thus,  interacting 
groups  differed  from  NGT  groups  only  in  that  NGT  groups  made  individual 
judgments  before  the  interaction.  Geometric  means  were  used  to  aggregate 
the  individual  likelihood  ratio  judgments.  Using  the  average  deviations 
of  the  group  judgments  from  the  true  likelihood  ratios,  NGT  groups  pro- 
duced the  best  assessments  and  Delphi  groups,  the  worst. 

A study  by  Gough  (1975)  used  the  same  four  types  of  groups  as 
used  by  Gustafson  et  al..  with  the  exception  that  the  interacting  groups 
made  individual  assessments  prior  to  interaction  and  actually  had  to 
reach  a consensus  during  the  interaction.  The  assessments  were  five 
fractiles  of  the  individuals'  cumulative  subjective  probability 


distribution  for, general  information  questions  and  a linear  aggregation 
model  was  used.  A quadratic  proper  scoring  rule  was  applied  to  evaluate 
the  probability  distributions.  Although  Gough's  results  indicated 
that  NGT  groups  produced  the  best  assessments,  the  differences  were 
quite  small  and  probably  did  not  justify  his  conclusions  favoring  the 
NGT. 

The  third  study  (Fischer,  1975)  used  the  same  types  of  groups 
as  Gough  with  a different  type  of  assessment.  Subjects  were  asked  to 
assess  the  probability  of  freshmen  GPA's  falling  into  four  mutually 
exclusive  and  exhaustive  categories  given  information  about  gender,  high 
school  GFA,  SAT  math,  and  SAT  verbal  scores.  Fischer’s  evaluation  method, 
a logarithmic  proper  scoring  rule  was  similar  to  Gough's,  as  were  his 
results.  There  was  virtually  no  difference  among  the  groups.  Fischer 
attributes  much  of  the  difference  between  his  results  and  those  of 
Gustafson  et  al.  to  the  dependent  variable  used  to  evaluate  the  assess- 
ments. His  basic  argument  is  that  large  differences  in  likelihood 
ratios  may  be  only  small  differences  when  transformed  into  probabilities, 
particularly  at  the  extreme  ends  of  the  probability  scale.  Thus,  it 
appears  results  may  very  much  depend  on  the  way  in  which  they  are 
evaluated. 


AM  EXPERIMENTAL  COMPARISON  AND  EVALUATION 


As  suggested  in  the  Introduction,  several  questions  about  how 
group  probabilities  should  be  assessed  need  to  be  answered  empirically. 

The  previous  section  outlining  the  mathematical  and  behavioral  inter- 
action approaches  to  assessing  group  probabilities  and  related  litera- 
ture indicates  that  these  questions  have  yet  to  be  answered.  This 
experiment  attempts  to  answer  these  questions. 

The  first  question  is  whether  interaction  of  some  kind  will 
improve  the  group  probabilities  compared  with  probabilities  derived  by 
mathematically  aggregating  the  individual  assessments.  If  interaction 
does  improve  assessments,  what  type  of  interaction  allows  the  most 
improvement?  In  this  study  four  types  of  interaction  were  used,  along 
with  a no  interaction  condition.  They  represent  the  interaction  processes 
typically  found  in  previous  research:  Delphi,  NGT,  and  interacting 
groups  forced  to  reach  a consensus  (hereafter  called  consensus  or  CON 
groups) , along  with  a fourth  process  (MIX)  that  is  somewhat  a mixture 
of  Delphi  and  NGT.  This  process,  like  NGT,  has  individuals  make  judgments 
and  present  them  to  the  grovp,  but  allows  only  presentation  of  specific 
reasons  for  the  judgment  without  open  discussion.  In  this  respect,  it 
is  more  similar  to  Delphi.  These  interaction  processes  represent  a con- 
tinuum with  respect  to  the  latitude  the  groups  have  for  interacting 
ranging  from  none  to  cosplete  freedom. 
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Another  issue  investigated  is  the  differences  in  group  prob- 
abilities caused  by  use  of  different  mathematical  aggregation  models. 

Because  different  models  can  be  used  depending  on  whether  the  assessed 
probabilities  are  discrete  or  continuous,  both  types  of  assessments 
were  obtained.  The  basic  aggregation  models  that  were  used  included 
the  linear  combination  model  for  both  discrete  and  continuous  probabili- 
ties, the  conjugate  model  with  weights  summing  to  one  and  to  m (the  number 
of  group  members),  for  continuous  probabilities,  the  discrete  counter- 
parts of  the  conjugate  model — weighted  normalized  geometric  mean  and  j 

aggregation  by  livelihood  ratios — and  the  pari-mutuel  model  for  discrete 
probabilities . Additionally,  three  sets  of  weights  were  used  with  each 
model  that  allows  for  differential  weighting:  weights  obtained  from 
the  DeGroot  (1974)  model;  weights  reflecting  each  individual's  self- 
rating relative  to  the  self-ratings  of  other  individuals;  and  equal 
weights.  Group  probabilities  derived  by  aggregating  individual  prob- 
abilities with  these  models  can  also  be  compared  with  the  consensus 
probabilities  decided  upon  by  CON  groups. 

An  additional  product  of  this  study  is  a comparison  of  indivi- 
dual and  group  probability  assessments  using  primarily  extremeness, 
calibration,  and  proper  scoring  rules.  A quadratic  scoring  rule  was 
used  for  discrete  assessments.  Continuous  assessments  were  evaluated 
with  a linear  transformation  of  the  continuous  ranked  probability  score, 

s*  - (su  - S)/Su, 

where  S is  the  usual  score  (eq.  2)  and  is  the  score  for  a 
uniform  distribution;  i.e.  p(0)  ■ 6.  This  permissible  transformation 
makes  the  scores  easier  to  interpret  since  S*  does  not  depend  on  the 

1 


26 


true  value  as  S does.  The  range  of  S*  is  from  -4.0  to  1.0  with  a 
uniform  distribution  receiving  a score  of  0.0. 

Experimental  Method 

Subjects.  Eleven  four-person  groups  were  used.  Ten  groins 
participated  in  the  assessment  of  discrete  probabilities,  but  one  of 
these  groups  was  unavailable  to  assess  continuous  probabilities  so  was 
therefore  replaced.  This  causes  no  problem  in  data  analyses  since  the 

data  from  the  two  types  of  assessments  are  analyzed  separately.  The  / 

subjects  were  predominantly  graduate  students  at  the  University  of 

Southern  California  or  their  friends.  Each  subject  was  familiar  with 

the  other  three  members  of  the  group.  Subjects  were  paid  $20  plus 

bonuses  based  on  evaluations  of  some  of  their  responses  with  the  proper 

scoring  rules  for  each  of  the  two  sessions,  bringing  total  payment  to 

approximately  $5  to  $6  per  hour. 

Stimuli.  For  the  discrete  assessments,  the  stimuli  were  100 
two-alternative  general  information  questions  randomly  sampled  from  a 
collection  of  about  700  such  questions.1  These  questions  were  randomly 
divided  into  five  sets  of  20  questions. 

The  continuous  stimuli  were  general  information  questions  about 
percentages.  A set  of  these  questions  was  developed  with  five  true  values 
falling  into  each  range  of  5 percent  from  10  percent  to  40  percent  and 
from  60  percent  to  90  percent  and  two  true  values  in  each  5 percent 
range  between  40  percent  and  60  percent.  These  questions  were  randomly 
assigned  to  five  sets  of  questions  so  that  each  set  had  one  true  value 
in  each  5 percent  range  from  10  percent  to  40  percent  and  from  60  percent 

would  like  to  thank  Sarah  Lichtenstein  for  making  these 
questions  available. 
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to  90  percent  and  two  true  values  in  each  5 percent  range  between  40 
percent  and  60  percent.  This  was  to  ensure  that  differences  in  the 
quality  of  probabilities  assessed  for  the  different  question  sets  were 
not  due  to  the  true  values  of  the  questions.  The  very  extreme  percent- 
ages were  avoided  because  of  the  large  biases  usually  found  in  assessed 
distributions  for  these  questions  (Fujii,  Seaver,  and  Edwards,  1977). 

Procedure.  Each  group  of  subjects  participated  in  two  sessions: 
the  first  assessing  discrete  probabilities  and  the  second,  continuous 
probabilities.  Sessions  lasted  from  three  to  four  and  a half  hours  with 
the  continuous  assessment  session  taking  about  an  hour  longer  than  the 
first  session  because  additional  training  was  needed.  Each  group  answered 
a different  set  of  questions  in  each  of  the  five  interaction  conditions. 
The  question  sets  and  the  order  of  interaction  conditions  were  balanced 
in  a 5 x 5 Greco-Latin  square. 

For  the  discrete  assessments,  subjects  were  required  to  choose 
the  answer  they  thought  was  most  likely  to  be  correct  and  then  assess 
the  probability  (p  > .5)  that  the  choice  was  correct.  Also,  for  each 
question  they  were  instructed  to  assign  weights  to  each  group  member 
reflecting  their  belief  about  how  much  each  group  member's  opinion  should 
contribute  to  the  "group  opinion."  These  weights  were  to  reflect  sub- 
jects prior  beliefs  about  the  expertise  of  the  group  members  with  respect 
to  the  question  under  consideration.  Each  individual  whose  opinion 
should  contribute  nothing  was  assigned  a weight  of  zero.  Of  the  re- 
maining group  members,  a weight  of  10  should  be  assigned  to  those 
whose  opinion  should  contribute  the  least.  Any  remaining  individuals 
should  be  assigned  weights  reflecting  their  contribution  relative 
to  those  receiving  weights  of  10.  For  example,  if  another  individual's 
opinion  should  contribute  five  times  as  much,  that  individual  would 
receive  a weight  of  50.  Weights  were  assigned  for  each  question 
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during  both  the  initial  and  final  probability  assessments  in  all  inter- 
action conditions. 


Subjects  were  then  given  a sheet  of  paper  showing  the  quadratic  scoring 
rule  that  would  be  -used  to  evaluate  their  assessments.  In  addition  to  a fixed 
payment  of  $20,  subjects  could  win  or  'loose  money  based  on  applying  the  scor- 
ing rule  to  judgments  on  two  randomly  selected  questions  from  each  set  of 
20.  The  paper  included  the  amount  to  be  won  or  lost  for  probabilities  between 
.5  and  1.0  in  steps  of  .05  plus  .99.  The  quadratic  scoring  rule  (eq.  1) 
was  linearly  transformed  so  that  any  assessment  of  .5  would  mean  nothing 


won  or  lost,  while  an  assessment  of  1.0  would  result  in  a win  of  one 


dollar  if  the  choice  was  correct,  or  a loss  of  three  dollars  if  the 
choice  was  wrong.  Four  sample  questions  were  answered  by  each  subject 
and  the  answers  to  these  questions  were  scored  to  illustrate  the  scoring 
rule . 

The  procedure  for  the  initial  individual  assessments  was  the 
same  for  each  interaction  condition.  The  subjects  answered  each  of  the 
20  questions  without  any  discussion  among  themselves,  \fter  all  group 
members  had  completed  these  questions,  the  procedure  varied  depending  on 
the  interaction  condition.  Table  1 shows  the  major  differences  in  the 
interaction  conditions. 

TABLE  1 


MAJOR  DIFFERENCES  IN  TYPES  OF  INTERACTION 


Type  of 
Interaction 

Reconsider  with 
Information  about 
Other  Judgment* 

Knowledge  of 
Judgment 
Source 

Verbal 

Information 

Exchange 

Uncontrolled 

Discussion 

Consensus 

Necessary 

None 

Delphi 

Yes 

MIX 

Yes 

Yes 

Yes 

NCT 

Yes 

Yes 

Yes 

Yes 

CON 

Yes 

Yes 

Yes 

Yes 

Yes 
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In  the  no  interaction  condition,  the  subjects  were  sinply  told 
the  answers  and  scored  two  pre-selected  questions. 

In  the  Delphi  condition,  the  experimenter  collected  he  assess- 
ments and  explained  that  the  subjects  would  have  two  subsequent  chances 
to  reassess  their  probabilities,  each  time  with  additional  information 
about  the  assessments  of  the  other  group  members.  For  each  question 
the  subjects  were  given  the  four  assessments  of  the  group  without  any 
information  about  who  made  which  assessment.  On  the  basis  of  this  in- 
formation they  reconsidered  their  judgments.  In  addition,  they  were 
instructed  to  write  any  information  that  might  be  useful  to  other  group 
members  in  space  provided  on  the  answer  sheets.  In  particular,  if  some- 
one's judgments  differed  radically  from  other  group  members,  that  person 
should  attempt  to  explain  the  reasoning  behind  the  judgment.  After  all 
20  questions  were  again  answered,  the  same  process  was  repeated  with 
the  feedback,  including  any  written  information  provided  by  the  subjects. 
After  the  final  set  of  responses  was  completed,  the  answers  were  given 
and  two  questions  were  scored  from  each  of  the  initial  and  final 
assessments . 

In  the  MIX  condition,  each  group  member  presented  his  or  her 
assessment  for  the  question  under  consideration  to  the  group  verbally. 
After  each  assessment  had  been  presented,  any  group  member  was  allowed 
to  state  any  reasons  underlying  the  assessment  or  any  information  that 
might  be  useful  to  other  group  members.  Each  individual  then  reconsidered 
the  assessment  for  that  question.  After  all  questions  had  been  considered 
a second  time,  the  answers  to  all  questions  were  given  and  two  assess- 
ments from  each  of  the  initial  and  subsequent  assessments  were  scored  for 


pay. 
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The  NGT  groups  were  the  same  as  the  MIX  groins  except  after 
presentation  of  the  individual  assessments,  a general  face-to-face  dis- 
cussion was  allowed  with  only  the  restriction  that  it  be  relevant  to  the 
question  under  consideration. 

The  CON  groups  differed  from  the  NGT  groups  in  that  the  presen- 
tation of  individual  judgments  was  not  required  and  the  groups  had  to 
reach  consensus  (agreement)  about  the  assessment. 

The  second  session,  in  which  continuous  probabilities  were 
assessed,  was  similar  in  most  respects  to  the  first  except  considerably 
more  training  for  the  assessments  was  provided.  For  these  questions  sub- 
jects were  requested  to  assess  two  parameters  of  a beta  distribution 
representing  their  opinion  about  the  possible  answers  to  questions  in- 
volving percentages.  Rather  than  asking  for  a and  B,  the  usual  beta 
parameters,  the  parameters  of  m = (a  - l)/(a  + 6 - 2) , the  mode,  and 
n = a + B - 2,  which  reflects  the  tightness  of  the  distribution  and  can 
be  considered  as  a sample  size,  were  assessed.  To  teach  the  subjects 
the  correspondence  between  these  parameters  and  the  actual  shape  of  the 
probability  distributions,  each  subject  was  given  a book  containing  graphs 
of  the  density  and  cumulative  distribution  functions  of  beta  distribu- 
tions with  values  of  m beginning  at  .05  and  increasing  by  steps  of  .05 
to  .95,  and  values  of  n equal  to  0,  2,  5,  10,  15,  20,  25,  30,  50,  75,  and 
100  for  each  m.  Each  of  the  graphs  also  included  the  corresponding  numeri- 
cal quantities  of  density  and  cumulative  probability  for  each  .05  incre- 
ment. Subjects  kept  these  books  for  reference  throughout  the  session. 

After  the  meaning  of  the  graphs  and  the  correspondence  between 
the  parameters  and  the  shape  of  the  distributions  was  explained,  a test 
was  made  to  ensure  that  the  subjects  knew  this  correspondence.  Subjects 
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were  presented  graphs  of  various  beta  densities  and  cumulative  distri- 
butions together  and  asked  to  estimate  the  parameters  of  the  distribu- 
tions. Actually,  only  the  parameter  n was  estimated  since  subject*?  had 
no  trouble  with  the  correspondence  between  m and  the  distributions. 

These  graphs  were  presented  to  the  subjects  individually  until  12  consec- 
utive estimates  (each  subject  three  times)  of  n were  between  2/3  and 
3/2  of  the  true  value.  The  total  number  of  graphs  presented  ranged  from 
89  to  208  for  the  various  groups. 

After  the  training,  subjects  were  instructed  about  the  scoring 
rule  to  be  used  for  these  assessments,  given  four  practice  questions  and 
answers,  reminded  of  the  procedure  for  assessing  weights,  and  began  the 
task  with  the  interaction  conditions.  Following  the  completion  of  the 
second  session,  subjects  were  questioned  as  to  which  procedure  they  would 
prefer  to  use  if  they  were  in  a real  decision  making  group  which  needed 
to  determine  some  relevant  probability. 

Results 

Discrete  assessments.  The  average  quadratic  scores  of  various 
aggregation  models  both  before  any  interaction  and  after  each  of  the 
types  of  interaction  are  presented  in  Table  2(a),  along  with  the  average 
individual  scores  and  the  average  score  of  the  actual  consensus  assess- 
ments. The  aggregation  models  are  the  linear  model  (eq.  3),  the 
geometric  mean  model  (eq.  5),  and  the  likelihood  ratio-.model  (eq. :4). 

The  three  weighting  procedures  are  equal,  DeGroot  (1974),  and  self- 
rating, derived  by  ftrst  normalizing  the  weights  assigned  by  each  indi- 
vidual to  sum  to  one  and  then  again  normalizing  the  (normalized)  weights 
individuals  assigned  to  themselves. 

ft 
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where  w_^  is  the  derived  self-weight  for  individual  i and  w^  . is  the 
weight  assigned  by  individual  i to  individual  j. 

The  most  notable  result  in  Table  2(a)  is  that  the  likelihood  ratio 
model  does  quite  poorly.  The  linear  and  gemoetric  mean  models  differ  only 
slightly  as  do  the  weighting  procedures.  Also,  interaction  does  not  generally 
seem  to  have  much  effect  on  the  group  scores,  although  the  NGT  scores 
tend  to  be  somewhat  higher,  but  it  increases  the  individual  scores.  An 
analysis  of  variance  with  five  repeated  measures  factors:  interaction 
type,  questions,  repetition  (before  or  after  interaction),  aggregation 
model  (only  linear  and  geometric  mean),  and  weights,  generally  confirmed 
these  conclusions.  Other  than  the  questions  factor,  which  is  of  little 
interest  here,  no  main  effects  were  significant  and  only  two  interaction 
terms  were  significant:  the  aggregation  model  by  weights  interaction, 

F (2, 18) =11 . 7 , p <_. 001,  and  the  repetition  by  aggregation  model  by  weights 
interaction,  F (2 ,18)  =8 . 48 , p <_.003. 

Although  the  evaluation  with  the  scoring  rule  shows  little  differ- 
ence among  the  group  probabilities , other  characteristics  show  more  dis- 
tinct effects.  Table  2 Cb)  shows  the  average  probabilities  assigned  to 
the  correct  response,  a measure  of  the  extremeness  of  the  assessments. 

The  group  probabilities  are  more  extreme  than  the  individual  probabil- 
ities, the  likelihood  ratio  model  produces  the  most  extreme  probabilities 
and  interaction  leads  to  more  extreme  probabilities.  An  analysis  of 
variance  confirmed  the  effects  apparent  in  the  means  showing  the  proba- 
bilities to  be  more  extreme  after  interaction,  F(l,9)=30.5,  p <_.001, 
more  extreme  with  the  geometric  mean  than  the  linear  model,  F (1,9) -29. 2, 
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TABLE  2(a) 

AVERAGE  QUADRATIC  SCORES 


Linear 

? Geometric  Mean 

Likeli- 

hood 

Ratio 

Actual 

Consen- 

sus 

Indi- 

vidual 

Equal 

m 

De- 

Groot 

Equal 

De- 

Groot 

Before 

.562 

.565 

.572 

.570 

.551 

.568 

.495 

.494 

After 

.577 

.569 

.573 

.557 

.545 

.549 

.449 

.556 

.541 

Delphi 

.573 

.557 

.577 

.541 

.529 

.542 

.447 

.529 

MIX 

.565 

.558 

.554 

.547 

.528 

.527 

.429 

.526 

NGT 

.599 

.595 

.599 

.584 

.576 

.582 

.465 

.556 

CON 

.572 

.564 

.562 

.555 

.547 

.544 

.454 

.551 

TABLE  2(b) 

AVERAGE  PROBABILITY  ASSIGNED  TO  CORRECT  ANSWER 


Linear 

! Geometric  Mean  f 

Likeli- 

hood 

Ratio 

Actual 

Consen- 

sus 

Indi- 

vidual 

Equal 

De- 

Groot 

Equal 

ggS 

De- 

Groot 

Before 

.552 

.575 

.590 

.604  , 

.601 

.636 

.552 

After 

.613 

.623 

D 

.631 

.634 

.633 

.655 

.635 

.613 

Delphi 

.605 

.611 

.614 

.621 

.621 

.626 

.655 

.605 

MIX 

.594 

.607 

.602 

.617 

.620 

.616 

.638 

.594 

NGT 

.627 

.639 

.638 

.648 

.653 

.654 

.667 

.627 

CON 

.626 

.633 

.627 

.637  ; 

.641 

.635 

.660 

.626 

34 


p <_  .001,  and  less  extreme  with  equal  weights,  F(2,1B)«12.5,  p <_  .001. 

In  addition,  the  three  two-way  interactions  among  these  three  factors 
were  significant.  However,  neither  the  main  effect  due  to  interaction 
type,  nor  any  of  the  interactions  with  that  factor  were  significant. 

Calibration,  another  desirable  feature  of  probabilities,  also 
showed  some  differences.  Figure  1 shows  the  calibration  curves  for 
group  and  individual  probabilities,  both  before  and  after  interaction. 

The  group  probabilities  are  aggregated  over  both  the  linear  and  geo- 
metric mean  models,  all  three  weighting  procedures,  and  all  interaction 
types.  Group  probabilities  are  clearly  better  calibrated  than  indi- 
vidual probabilities  before  interaction,  but  interaction  causes  the 
calibration  of  the  group  probabilities  to  get  worse  while  improving 
the  calibration  of  the  individual  probabilities. 

Neither  weighting  procedures  nor  type  of  interaction  had  any 
notable  effect  on  calibration,  so  the  calibration  curves  for  the 
aggregation  models  shown  in  Figure  2 before  interaction,  and  Figure  3 
after  interaction  are  aggregated  over  those  variables.  The  linear 
model  leads  to  quite  well-calibrated  probabilities  before  interaction, 
while  the  likelihood  ratio  model  produces  very  poor  calibration. 

The  use  of  the  Dari-mutuel  model  for  aggregating  individual 
probabilities  had  to  be  limited  for  cost  reasons.  To  aggregate  the  prob- 
abilities of  all  groups  for  all  questions  using  all  weighting  procedures 
would  have  required  over  100  hours  of  cpu  time.  To  reduce  this  computa- 
tion to  a more  realistic  level,  one  of  the  ten  groups  was  randomly  selected 
and  the  pari-mutuel  model  was  used  to  aggregate  the  individual  assess- 
ments of  that  group.  Table  3 gives  the  mean  quadratic  scores  and  mean 
probabilities  assigned  to  tfie  correct  response  for  the  assessments  of  this 
group  only.  The  pari-mutuel  model  generally  produced  lower  scores  and 
less  extreme  probabilities  than  the  linear  or  geometric  mean  models.  The 
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Figure  1 

INDIVIDUAL  VERSUS  GROUP  CALIBRATION t DISCRETE  ASSESSMENTS 


Proportion  Correct 
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t Figure  2 

CALIBRATION  OF  DIFFERENT  AGGREGATION  MODELS  BEFORE 
INTERACTION:  DISCRETE  ASSESSMENTS 


Assessed  Probability 


Proportion  Correct 
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Figure  3 

CALIBRATION  OF  DIFFERENT  AGGREGATION  MODELS  AFTER 
INTERACTION:  DISCRETE  ASSESSMENTS 


Assessed  Probability 
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TABLE  3(a) 

AVERAGE  QUADRATIC  SCORES  FOR  SINGLE  RANDOMLY  SELECTED  GROUP 


Before  Interaction 

n 

After  Interaction 

Weights 

■ 

Weights 

Model 

Equal 

Self- 

rating 

DeGroot 

Model 

Equal 

Self- 

rating 

DeGroot 

Linear 

.612 

.643 

.618 

■ 

Linear 

.658 

.668 

.663 

Geometric 

Mean 

.639 

.638 

.636 

Geometric 

Mean 

.673 

.676 

.674 

Pari- 

Mutuel 

.546 

.596 

.556 

Pari- 

Mutuel 

.605 

.630 

.608 

TABLE  3(b) 

AVERAGE  PROBABILITY  ASSIGNED  TO  CORRECT  ANSWER  FOR 
SINGLE  RANDOMLY  SELECTED  GROUP 


Before  Interaction 


Weights 

Weights 

Model 

Equal 

Self- 

rating 

DeGroot 

Model 

Equal 

Self- 

rating 

DeGroot 

Linear 

.578 

.622 

.586 

Linear 

.634 

.650 

.643 

Geometric 

Mean 

.626 

.654 

.631 

Geometric 

Mean 

.656 

.666 

.661 

Pari- 

Mutuel 

.532 

.576 

.539 

Pari- 

Mutuel 

.588 

.611 

.594 

After  Interaction 
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relatively  higher  scores  and  more  extreme  probabilities  using  self- 
rating weights  are  apparently  an  anomaly  of  this  particular  group. 

Figure  4 shows  the  pari-mutuel  calibration  curves  for  this 
group,  along  with  the  calibration  of  the  linear  model  for  reference 
curves.  Given  the  irregularity  of  the  curves  and  the  small  samples 
on  which  they  are  based  (each  point  represents  about  40  to  60  assess- 
ments) , the  calibration  resulting  from  the  use  of  the  pari-mutuel  model 
does  not  appear  to  be  systematically  different  from  the  linear  model. 

Since  assessments  tended  to  become  more  extreme  after  interaction, 
some  of  the  factors  that  might  affect  changes  in  probability  assessments 
were  examined.  Four  types  of  qualitative  changes  were  considered: 
switches  to  the  other  answer;  less  extreme  assessments;  no  change; 
and  more  extreme  assessments.  The  factors  considered  were  the  split  of 
initial  individual  answers,  all  agree  (4-0),  2-2,  and  3-1  for  both  the 
three  individuals  and  the  single  individual;  the  type  of  interaction; 
the  individual's  probability  relative  to  those  given  by  group  members 
selecting  the  other  answer;  and  the  individual's  probability  relative 
to  those  given  by  group  members  selecting  the  same  answer.  The  latter 
two  factors  were  divided  into  three  categories,  larger  than  all  the 
other  probabilities,  between  or  equal  to  the  other  probabilities,  or 
smaller  than  all  the  other  probabilities.  Table  4 presents  the  con- 
ditional percentages  of  changes  for  the  given  levels  of  each  of  these 
factors. 

Changes  generally  display  the  intuitively  expected  patterns. 

The  more  other  group  members  agree  with  an  individual,  the  less  likely 
that  individual  is  to  switch  answers,  and  the  judgment  is  more  likely  to 
become  more  extreme.  Switches  are  more  likely  for  individuals  with  prob- 
abilities smaller  than  those  both  with  whom  they  agree  and  with  whom  they 
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Figure  4 

CALIBRATION  OF  SINGLE  RANDOMLY  SELECTED  GROUP  INCLUDING  PARI-MUTUEL  MODEL: 

DISCRETE  ASSESSMENTS 
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TABLE  4 

CONTINGENCY  TABLES  FOR  CHANGES  IN  INDIVIDUAL  JUDGMENTS 

(Percentages) 


Change 


More 

Extreme 


e of  Interaction 


Delphi 

MTV 


Marginal  Means 


Compared  to 
Probabilities  for 
Other  Answer 


Larger 

Between  or  Equal 
Smaller 


Marginal  Means 


25.8 

41.9 

41.8 

33.3 

37.0 

35.1 

29.0 

30.9 

33.8 

36.5 

45.7 

28.4 

28.3 

26.2 

23.1 

36.3 

32.5 

29.9 

45.5 

36.2 

19.8 

9.4 

46.3 

53.6 

34.4 

40.0 

Compared  to 
Probabilities  for 
Same  Answer 


Larger 

Between  or  Equal 
Smaller 


Marginal  Means 
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disagree.  Among  individuals  who  agree,  there  is  a tendency  toward 
averaging  with  the  largest  assessments  remaining  the  same  or  getting 
less  extreme,  while  the  middle  assessments  remain  the  same  or  become 
more  extreme,  and  the  smallest  assessments  become  more  extreme. 

These  aggregated  tables  mask  some  of  the  more  striking 
effects.  For  example,  with  a 4-0  split,  92  percent  of  the  subjects  with 
the  smallest  assessments  became  more  extreme.  Or  with  a 3-1  split,  68 
percent  of  the  single  individuals  switched  if  their  probability  was  smaller 
than  those  of  the  individual  with  whom  they  disagreed,  while  only  20  per- 
cent switched  if  their  probability  was  larger.  These  tables  also  conceal  a 
non-intuitive  interaction:  among  the  three  agreein a individuals  in  a 3-1 
split,  individuals  with  assessments  less  than  or  equal  to  the  assessment 
of  the  individual  who  disagreed  were  more  likely  to  switch  if  their  assess- 
ment was  larger  than  the  assessments  of  the  agreeing  individuals  (32%)  them 
if  it  was  equal  to  or  between  (22%)  or  smaller  (21%) . 

Overall,  interaction  did  produce  a convergence  in  judgments. 

The  standard  deviations  of  individual  judgments  were  reduced  by  an  average 
of  25  percent,  26  percent,  27  percent,  and  53  percent  after  Delphi, 

MIX,  NGT , and  CON  interactions  respectively. 

Continuous  assessments.  Table  5(a)  shows  the  mean  scores  of 
both  individual  and  group  assessments  with  the  various  aggregation  models, 
weighting  procedures,  and  types  of  interaction.  The  two  aggregation 
models  are  the  linear  model  and  the  conjugate  model  with  weights  summing 
to  one.  The  linear  model  group  probabilities  were  calculated  by  averaging 
the  distributions  at  each  step  of  5 percent.  All  scores  were  computed 
by  assuming  a linear  cumulative  distribution  between  each  5 percent  step. 
These  approximations  were  necessary  to  reduce  computation  time. 
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TABLE  5(a) 

AVERAGE  SCORES  FOR  CONTINUOUS  ASSESSMENTS 


Linear  j 

Conjugate 

Actual 

Consen- 

sus 

Indi- 

vidual 

Equal 

Self- 

rating 

■ 

Equal 

Self- 

rating 

De- 

Groot 

Before 

.005 

-.011 

.003 

-.058 

-.080 

-.064 

-.186 

After 

-.063 

-.077 

-.072 

-.112 

-.120 

-.119 

-.016 

-.035 

Delphi 

-.036 

-.057 

m 

-.079 

-.011 

MIX 

-.094 

-.100 

RSI 

-.172 

-.114 

NGT 

-.012 

-.026 

-.016 

-.078 

-.072 

.006 

CON 

-.112 

-.125 

-.123 

1 -.146 

-.158 

-.155 

-.022 

TABLE  5(b) 

AVERAGE  DENSITY  FOR  CORRECT  ANSWER 


Linear  ] 

Conjugate 

Actual 

Consen- 

sus 

Indi- 

vidual 

Equal 

Self- 

rating 

De- 

Groot 

Equal 

Self- 

rating 

De- 

Groot 

Before 

1.71 

1.72 

1.74 

2.05 

— 

2.07 

1.73 

After 

2.03 

1.99 

1.99 

2.08 

■'JVi/fl 

■m 

2.09 

2.20 

2.00 

Delphi 

2.20 

1.98 

1.98 

2.15 

2.04 

2.14 

2.01 

MIX 

1.84 

1.87 

1.83 

1.85 

1.93 

1.84 

1.85 

NGT 

1.97 

1.96 

2.00 

2.09 

2.10 

2.12 

1.99 

CON 

2.13 

2.13 

2.15 

2.23 

2.23 

2.24 

2.16 

TABLE  5(c) 

AVERAGE  IQ  RANGE  FOR  CONTINUOUS  ASSESSMENTS 


| Linear  ’ 

1 Conjugate  1 

Actual 

Consen- 

sus 

Indi- 

vidual 

Equal 

Self- 

rating 

De- 

Groot 

Equal 

Self- 

rating 

De- 

Groot 

Before 

.266 

.254 

254 

.126 

.124 

.140 

After 

.157 

.152 

.152 

.113 

bJSI 

.113 

.096 

.115 

Delphi 

sa 

V 

mm 

.113 

.113 

.113 

MIX 

E3 

.118 

.116 

.119 

NGT 

HI  i 3 

.157 

.156 

.117 

.114 

.115 

1 

CON 

.121 

.120 

.119 

.104 

.104 

.105 

KB1 
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Subjects  did  not  generally  do  well  on  this  task  as  shown  by  the 
scores  being  negative;  i.e.,  worse  than  the  score  obtained  with  a uniform 
distribution.  Correlations  between  the  individual  assessed  modes  and  the 
true  values  were  only  .37  before  interaction  and  .50  after  interaction, 
at  best  a moderate  relationship.  In  addition,  the  assessed  values  of 
n and  the  error  measured  by  the  absolute  difference  between  the  mode  and 
the  true  value  were  not  related  (r=-.06  before  interaction  and  -.03  after 
interaction) . 

Interaction  lowered  the  scores  of  the  group  probabilities 
F(l,9)=fc.l9,  p <_  .035,  while  raising  those  of  individuals.  In  fact, 
after  interaction,  the  individuals  had  higher  scores  than  the  groups.  The 
best  scores  were  received  by  the  actual  consensus  judgments.  However,  in- 
spection of  the  means  indicates  the  differences  are  rather  trivial  in  size. 
The  repetition  by  model  interaction  was  also  significant,  F (1 , 9)  = 32 . 3 , 
p < .001,  but  again  the  differences  were  rather  small.  As  for  discrete 
assessments,  neither  the  type  of  interaction  nor  the  weights  used  made  a 
difference  in  the  scores. 

Table  5 (bj  shows  the  mean  densities  at  the  true  values.  Again, 
the  assessments  became  more  extreme  after  interaction,  f(1,9)=13.3, 
p <_  .005,  and  aggregation  with  the  conjugate  model  leads  to  higher  densi- 
ties, F(1.9)=130.0,  p <_  .001.  The  interaction  between  these  factors  was 
also  significant,  F(l,9)  = 34.0,  p <_  .001. 

Another  characteristic  of  the  continuous  probability  assessments 
that  reflects  their  extremeness,  but  not  necessarily  their  accuracy,  is 
their  dispersion.  Table  5(c)  shows  the  mean  values  of  one  measure  of 
dispersion,  the  interquartile  (IQ)  range.  Group  distributions  derived 
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with  the  linear  model  ovbiously  have  larger  dispersion  than  those  from 
the  conjugate  model,  F (1 , 9) =161 . 2,  p <_  .001.  And  interaction  considerably 
reduces  the  IQ  range,  F (1,9) =164. 2,  p _<  J01,  particularly  with  the 

linear  model  (repetition  x aggregation  model  interaction,  F Cl, 9) =130. 2, 
l-  p _<  .001).  Additionally,  the  CON  interaction  produced  smaller  IQ  ranges 

(type  of  interaction  main  effect,  F (3 , 27) =5 . 51 , p <_  .004),  indicating 
generally  more  agreement  as  a result  of  this  type  of  interaction.  Also 
here,  surprisingly, the  weights  made  a significant,  F (2 , 18) =8. 73 , p <_  .002, 
although  not  substantial  difference:  equal  weights  produced  more  dis- 
persed distributions.  All  the  two-way  interactions  among  repetitions, 
aggregation  model,  and  weights  were  significant,  although  all  except  the 
repetition  by  aggregation  model  were  relatively  less  substantial  than 
the  main  effects. 

How  well  calibrated  are  the  continuous  assessments?  Figure  5 
shows  the  calibration  curves  for  the  individual  assessments  before  inter- 
action and  the  group  assessments  both  before  and  after  interaction  aggregated 
across  weighting  procedures,  linear  and  conjugate  models,  and  all  types  of 
interaction.  Individual  calibration  after  interaction  is  not  shown  be- 
cause it  differs  little  from  the  calibration  before  interaction  (maximum 
vertical  difference  in  curves  = .013).  These  curves  plot  the  percentage 
of  true  values  (ordinate)  falling  below  the  specified  value  of  the  cumulative 
distribution  (abscissa).  Perfect  calibration  would  result  in  a straight 
line  from  (0,0)  to  (1,1).  The  specific  percentages  of  true  values  falling 
in  the  tails  (less  than  .01  or  greater  than  .99)  and  in  the  IQ  range  of  the 
distributions  are  tabulated  in  the  figure.  These  values  are  often  used  to 
measure  the  calibration  of  continuous  assessments  when  the  entire  distri- 
butions are  not  assessed.  All  the  distributions  tend  to  be  too  tight  with 
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Proportion  of  True  Answers  Less  Than  Assessed  Cumulative 
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Assessed  Cumulative  Probability 


Tails 

10  Range 

Individual  Before  Interaction 

25.1% 

28.7% 

Individual  After  Interaction 

24.9% 

27.6% 

Group  Before  Interaction 

15.1% 

40.3% 

Group  After  Interaction 

22.1% 

31.7% 
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too  many  true  values  in  the  tails  and  too  few  in  the  IQ  ranges.  The 
group  distributions,  however,  are  better  calibrated  than  the  individual 
distributions,  although  as  with  the  discrete  probabilities,  interaction 
leads  to  poorer  calibration  for  the  group  distributions.  Also,  inter- 
estingly, the  curves  are  not  symmetric:  the  assessed  distributions 
are  displaced  to  the  left  of  the  true  value  more  often  than  to  the  right. 

The  type  of  interaction  again  had  little  effect  on  calibration: 
percentages  of  true  value  in  the  tails  ranged  from  18  percent  for  NGT 
groups  to  27  percent  for  CON  groups;  and  IQ  range  percentages  ranged 
from  29  percent  for  CON  groups  to  35  percent  for  Delphi.  But  as  shown 
in  Figures  6 and  7.  the  aaareaation  model  did  affect  calibration  both 
before  and  after  interaction.  The  calibration  curves  are  clotted  onlv 
for  eaual  weiahts  since  the  curves  for  other  weiahtina  procedures  are 
verv  similar  (see  Figures  for  maximum  discrepancies) . The  group  prob- 
abilities derived  with  the  linear  model  are  clearly  better  calibrated 
than  those  from  the  conjugate  model.  In  fact,  before  interaction  the 
linear  model  probabilities  are  very  well  calibrated,  except  for  a slight 
underestimation  displacement.  Otherwise,  the  group  distributions  are 
too  tight  (too  many  true  values  in  the  tails  and  too  few  in  the  IQ  range) 
and  all  are  generally  displaced  tc  the  left  (underestimation) . 

Analyses  were  not  performed  on  distributions  resulting  from 
aggregation  with  the  conjugate  model  and  weights  summing  to  four,  the 
number  of  individuals  in  the  group.  The  result  of  larger  weights 
would  be  only  to  decrease  considerably  the  dispersion  of  the  already 
too  tight  distributions  without  changing  the  accuracy  (as  measured  by 
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Figure  6 


CALIBRATION  OF  DIFFERENT  AGGREGATION  MODELS  AND  WEIGHTING  PROCEDURES 
BEFORE  INTERACTION:  CONTINUOUS  ASSESSMENTS 
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Model 

Weights 

Tails 

IQ  Ranqe 

Discrepancy 

Linear 

Equal 

5.0% 

52.4% 

— 

Linear 

Self-rating 

5.3% 

50.5% 

.027 

Linear 

DeGroot 

5.6% 

50.6% 

.013 

Conjugate 

Equal 

23.7% 

29.8% 

— 

Conjugate 

Self-rating 

25.3% 

29.4% 

.015 

Conjugate 

DeGroot 

25.2% 

29.3% 

.007 

• A 


- 
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Cumulative  Probabilities 


Figure  7 

CALIBRATION  OF  DIFFERENT  AGGREGATION  MODELS  AND  WEIGHTING  PROCEDURES 
AFTER  INTERACTION:  CONTINUOUS  ASSESSMENTS 
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Model 

Weights 

Tails 

10  Ram 

Linear 

Equal 

16.1% 

34.4% 

Linear 

Self-rating 

16.8% 

35.0% 

Linear 

DeGroot 

17.3% 

34.2% 

Conjugate 

Equal 

27.1% 

28.6% 

Conjugate 

Self-rating 

28.1% 

29.2% 

Conjugate 

DeGroot 

27.9% 

28.6% 

Maximum 
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When  questioned  at  the  end  of  the  experiment  as  to  which 


procedure  they  would  prefer  to  use  in  a real  decision  making  situation, 
subjects  exhibited  a clear  preference  for  interaction  with  some  open, 
face-to-face  discussion.  Twenty  subjects  prefered  the  NGT  procedure, 

19  the  CON  procedure,  and  1 the  MIX  procedure. 


DISCUSSION 


As  is  the  case  with  all  studies  of  the  size  and  complexity  of 
this  one,  some  "significant"  results  can  always  be  teased  out  of  the 
data.  Rather  than  focusing  on  particular  significant  results,  or  non- 
significant ones  for  that  matter,  I will  discuss  some  fairly  general 
conclusions  and  their  implications  for  assessinq  qroup  probabilities  in 
actual  decision  makinq  contexts. 

The  results  of  this  study  can  be  viewed  from  two  perspectives. 

From  the  osvcholooical  viewpoint , the  results  are  relatively  uninterest- 
ing. The  type  of  interaction  groups  are  allowed  seems  to  have  little 
effect  on  subsequent  judgments  such  as  those  in  this  study,  although  all 
types  produce  some  effects.  But  the  implication  for  applications  of 
decision  theory  are  important:  use  simple,  mathematical  aggregation 
procedures.  Simple  procedures,  such  as  combining  individual  probability 
assessments  linearly  with  equal  weights,  produce  group  assessments  that 
are  as  good  as  or  better  than  those  produced  by  more  complicated  pro- 
cedures involving  interaction  or  complex  aggregation  models.  Interaction 
among  the  assessors  produces  only  a feeling  of  satisfaction,  and  not 
any  overall  improvement  in  the  quality  of  the  assessed  probabilities. 
Naturally,  the  results  of  this  study  are  not  as  simple  and  straight-for- 
ward as  these  two  viewpoints  imply,  but  they  do  capture  the  spirit  of 
this  research. 

These  conclusions  are  not  new  or  unique  to  this  research.  Fischer 
(1975)  concurs  with  the  lack  of  effect  on  group  probabilities  due  to  the 
type  of  interaction,  and  Gough  (1975)  presents  results  that  appear  to 
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support  this  lack  of  effect,  although  he  does  not  explicitly  adopt 
such  a position.  Dalkey  (1969b),  Gustafson  et  al . (1973),  and  Van  de 
Ven  and  Delbecq  (1974)  have  argued  in  favor  of  specific  interaction 
procedures;  Oalkey  for  Delphi,  and  Gustafson  et  al . and  Van  de  Ven  and 
Delbecq  for  NGT.  But  Dalkey 's  conclusion  is  supported  by  very  weak 
evidence,  and  the  latter  two  studies  rely  on  suspect  evaluation  criteria. 

On  the  model  side  of  the  question,  the  literature  indicating 
little  difference  in  aggregation  models  due  to  weighting  procedures  is 
becoming  extensive  (cf.  Dawes  and  Corrigan,  1974;  Wainer,  1976) . And 
in  contexts  other  than  aggregating  probabilities  (e.g.,  multiattribute 
utility  models) , linear  models  have  been  shown  to  produce  results  quite 
similar  to  those  of  non-linear  models  (Fischer,  1972;  Newman,  Seaver, 
and  Edwards,  1976).  This  study  has  confirmed  the  lack  of  effect  of 
different  weighting  schemes  and,  at  least  in  the  case  of  discrete 
assessments,  the  similarity  of  results  from  linear  and  multiplicative 
aggregation  models  for  the  particular  case  of  aggregating  individual 
probabilities  to  form  a group  probability. 

The  result  of  interaction  among  assessors  is  quite  clear  for 
both  discrete  and  continuous  assessments — it  produces  more  extreme 
and  less  well  calibrated  assessments.  If  all  of  the  members  of  the 
group  agree  on  an  answer,  or  if  even  three  agree,  the  individual  assess- 
ments tend  to  become  more  extreme.  Apparently,  subjects  treat 
the  information  provided  by  other  group  members'  assessments  as  some- 
what independent  of  their  own  information,  rather  than  redundant.  With 
the  particular  type  of  questions  and  subjects  used  in  this  study,  this 
assunption  of  independence  is  probably  unwarranted,  as  shown  by  an  analysis 
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of  the  data.  Since  the  sum  of  weights  in  the  multiplicative  aggregation 
model  (eg.  5)  can  be  used  as  an  indication  of  the  degree  of  independence 
of  the  individual  assessments,  the  initial  individual  discrete  assess- 
ments were  aggregated  for  each  group  by  the  multiplicative  model  with 
the  sum  of  the  weights  varying  from  1.0  to  4 in  steps  of  .1  and  from  4 
to  10  in  steps  of  .5.  The  aggregated  assessments  were  scored  with 
the  quadratic  scoring  rule.  The  best  average  score  was  obtained  with 
the  weights  summing  to  1.0,  indicating  little  independent  information 
in  the  assessments  of  different  individuals. 

Situations  where  different  assessors  can  be  expected  to  possess 
somewhat  independent  information  clearly  cannot  be  assumed  to  produce 
results  similar  to  those  of  this  study.  More  extensive  modeling  may  be  re- 
quired in  such  situations,  unless  subsequent  research  shows  some  type 
of  interaction  can  be  beneficially  used.  But  practical  considerations 
can  be  used  to  guide  selection  of  a procedure  for  determining  a group 
probability  when  there  is  no  a priori  rationale  for  distinguishing  among 
multiple  assessors.  Use  of  a simple  mathematical  model  to  aggregate 
initial  individual  assessments  rather  than  any  type  of  interaction  can 
lead  to  considerable  savings  in  time  and  effort  on  the  part  of  decision 
makers  or  other  ejq>erts.  Linear  aggregation  is  particularly  attractive 
because  of  its  computational  simplicity  which  makes  it  easily  understood, 
and,  therefore,  possibly  more  acceptable.  However,  simple  mathematical 
aggregation  of  any  sort  may  not  be  an  acceptable  procedure  to  decision 
makers.  As  indicated  by  the  subjects  in  this  study  who  overwhelmingly 
preferred  some  type  of  interaction  with  open  face-to-face  discussion, 
procedures  involving  interaction  may  be  desired.  If  this  is  true,  the 
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NGT  procedure  would  appear  to  be  the  procedure  of  choice.  Although  it 
was  generally  not  "significantly"  better  than  other  procedures  in  this 
study,  it  was  somewhat  better,  as  it  has  been  in  other  studies  (Gough, 
1975;  Gustafson  et  al. , 1973). 

Snapper  and  Seaver  (1978)  provide  an  example  of  a situation 
where  mathematical  aggregation  is  a preferable  alternative  to  an  inter- 
active process.  As  part  of  the  evaluation  of  a national  criminal  justice 
program,  probabilistic  judgments  about  expected  program  outcomes  are 
being  obtained  from  experts.  Simply  averaging  these  judgments  rather 
than  bringing  the  experts  together  to  interact  reduces  the  logistical 
complexity  and  the  cost  of  obtaining  the  judgments.  And  as  shown  by 
this  study, does  so  with  no  real  loss  in  the  quality  of  the  resulting 
probability  assessments. 
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The  application  of  decison  theory  often  involves  assessing  subjective  probabilities 
and  procedures  for  assessing  them  are  quite  well  developed.  But  such  procedures 
are  based  on  assessments  by  a single  person.  Often  multiple  individuals  are 
called  on  to  provide  the  probabilistic  judgments.  Unanimity  in  judgments  among 
the  multiple  Individuals  cannot  be  expected,  thereby  creating  the  problem  of 
how  to  arrive  at  a single  probability  distribution  that  can  be  used  in  applying 
decision  theory.  Two  general  approaches  to  this  problem  exist.  The  Individuals 
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^can  interact  as  a group  to  reach  a consensus,  or  the  Individual  jugments  can  be 
mathematically  aggregated  to  produce  a single  probability  distribution.  Each  of 
these  approaches  has  advantages  and  disadvantages.  Group  interaction  allows  the 
exchange  of  information,  but  may  be  susceptible  to  dominance  by  certain  indivi- 
duals or  pressure  for  conformity.  Mathematical  aggregation  is  simple"to  use  and 
ensures  that  a single  distribution  will  result,  but  theoretical  difficulties  are 
encountered  in  specifying  an  appropriate  aggregation  model.  Using  several  forms 
of  group  interaction  and  mathematical  aggregation  models,  this  research  Investi- 
gated the  quality  of  probabilities  produced  by  interaction  versus  mathematical 
models. ^"Quality"  was  measured  by  proper  scoring  rules,  calibration,  and  ex- 
tremeness on  two  types  of  probability  assessments:  discrete  assessments  for 
two-alternative  questions  and  beta  probability  density  functions  for  questions 
about  percentages.  Ten  four-person  groups  comprised  primarily  of  graduate  students 
assessed  probabilities  for  twenty  questions  of  each  type  in  each  of  five  types 
of  group  interaction:  no  interaction,  Delphi,  Nominal  Group  Technique  (NGT) 
a mix  of  Delphi  and  NGT,  and  discussion  to  consensus.  The  mathematical  models 
used  to  aggregate  the  individual  assessments  included  linear  model,  the  weighted 
geometric  mean,  and  the  pari-mutuel  model  for  discrete  assessments;  and  the 
linear  model  and  conjugate  model  for  densities;  each  with  various  weighting 
procedures.  Applying  proper  scoring  rules  to  the  group  probabilities  indicated 
that  simple  mathematical  aggregation  without  any  interaction,  e.g.  linear 
aggregation  with  equal  weights,  generally  produced  group  probabilities  as  good  as 
those  assessed  after  interaction.  Interaction  did  produce  more  extreme  but  less 
well  calibrated  assessments,  with  the  type  of  interaction  having  little  effect. 
Generally,  the  calibration  of  mathematically  aggregated  group  probabilities 
prior  to  any  interaction  was  quite  good,  clearly  better  than  the  calibration  of 
individual  assessments.  These  results  may  appear  relatively  uninteresting  from 
a pscyhological  perspective  because  of  the  lack  of  differences  In  assessments 
after  different  types  of  Interaction.  But  the  implications  for  applications 
of  decision  theory  are  important.  In  many  instances,  simple,  mathematical 
aggregation  of  individual  probability  assessments  may-be  adequate  without 
resorting  to  more  elaborate,  practically  difficult,  and  time  consuming  interactive 
processes  or  modeling  efforts. 
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